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Abstract 

Many modern data mining applications are concerned with the analysis of datasets in 
which the observations are described by paired high-dimensional vectorial representations or 
"views". Some typical examples can be found in web mining and genomics applications. In 
this article we present an algorithm for data clustering with multiple views, Multi-View Pre- 
dictive Partitioning (MVPP), which relies on a novel criterion of predictive similarity between 
data points. We assume that, within each cluster, the dependence between multivariate views 
can be modelled by using a two-block partial least squares (TB-PLS) regression model, which 
performs dimensionality reduction and is particularly suitable for high-dimensional settings. 
The proposed MVPP algorithm partitions the data such that the within-cluster predictive 
ability between views is maximised. The proposed objective function depends on a mea- 
sure of predictive influence of points under the TB-PLS model which has been derived as an 
extension of the PRESS statistic commonly used in ordinary least squares regression. Us- 
ing simulated data, we compare the performance of MVPP to that of competing multi-view 
clustering methods which rely upon geometric structures of points, but ignore the predic- 
tive relationship between the two views. State-of-art results are obtained on benchmark web 
mining datasets. 

1 Introduction 

In recent years, an increasing number of data mining applications have arisen which deal with 
the problem of finding patterns in data points for which several blocks of high-dimensional mea- 
surements, or "views", have been obtained. Each view generally provides a different quantitative 
representation of the available random samples. For instance, in genomics, one view may represent 
the expression levels of all genes in the genome, and the paired view may represent the genetic 
alternations in each gene observed on the same biological samples [31 j ; in web mining applications, 
web pages may be represented by the text content of the page and the hyperlinks pointing to it 
[3] . The fast-developing area of multi-view learning [5] is concerned with how to make combined 
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use of the information provided by the available views in order to perform specific data mining 
tasks, such as clustering and predictive modelling. 

A recurrent objective is that of detecting naturally occurring data clusters. When multiple 
views are available, there may be reasons to believe that the observations should cluster in the 
same way under each of the available views, hence the process of inferring the true clustering 
can be improved by making joint use of all views. A growing number of multi-view clustering 
algorithms have been proposed, and two main approaches seem to have emerged in the literature: 
late and early fusion methods. The late fusion methods first recover the clusters independently 
from each view (e.g. by using the K-means algorithm), and then attempt to infer a "consensus" 
clustering by combining the partitioning obtained within each view such that some measure of 
disagreement between the individual partitionings is minimised [T7J [TH1 E] ■ 

On the other hand, early fusion methods start by learning any common patterns that may 
be shared by the views and that could yield a joint clustering [3l [29l [9j [10] . A common assump- 
tion is that the data under each view are generated from a mixture of distributions where the 
mixing proportions which determine the cluster assignments are unknown but shared between 
views. Several methods rely on a two-step approach where the clusters are ultimately defined in 
terms of geometric separation between points lying in low-dimensional projections: first, a joint 
dimensionality reduction step is performed using both views, generally using canonical correla- 
tions analysis (CCA) which recovers latent factors explaining the correlation between the views; 
second, K-means is used to detect the clusters among data points in the lower dimensional space 
found in the first stage. Using CCA to initially perform dimensionality reduction has been shown 
to increase the separation between the clusters [9], and some non-linear extensions have also been 
explored [TU] . 

These multi-view clustering methods have become particularly popular for a broad class of 
problems in web mining where data of many different types may co-occur on the same page, for 
example text, hyperlinks, images and video. A common application is that of clustering web pages 
[5]. Pages can be clustered together based on the similarity of both their text content and link 
structure, and the clusters identify broad subject categories. Another problem in the web mining 
domain involves clustering and annotating images represented as a vector of pixel intensities as 
well as bag of words containing the corresponding textual annotation [6] . The ability to accurately 
cluster these data have implications in web search and advertising. 

A different multi-view learning scenario arises when one representation of the data is treated 
as a high-dimensional predictor or "explanatory" vector, and the paired view represents a high- 
dimensional "response" vector. The task then consists in fitting a regression model such that, when 
a new observation has been observed under the explanatory view, the corresponding representation 
of that observation in the response view can be optimally predicted. 

In settings when both the explanatory and response views are high-dimensional, two-block par- 
tial least squares (TB-PLS) regression has proved to be a particularly useful method for modelling 
a linear predictive relationship between two high-dimensional views [301 125] . TB-PLS performs 
dimensionality reduction in both predictor and response views simultaneously by assuming that 
each multivariate representation of the data can be factorized in a set of mutually orthogonal 
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latent factors that maximise the covariance between the views. This regression model overcomes 
problems relating to multicollinearity by estimating least squares regression coefficients using the 
lower dimensional projections. Among other applications, the model has been successfully used in 
genomics, where regions of DNA that are highly predictive of gene expression need to be detected 
[IB], and in computational finance, where the returns of several indices have been predicted using 
a large basket of assets [31] . 

In this article we also assume that two views are available, and set out to discover any potential 
partitions of the data by using them jointly. Whereas other multi-view clustering approaches rely 
on geometrical structures, we assume that any two points should be assigned to the same cluster 
if they both appear to be modelled equally well by the same regression model. In this respect, 
multi-view clustering is framed as a problem of learning the unknown number of multi-response 
regression models in high-dimensions. This is accomplished by first introducing a novel criterion for 
quantifying the predictive influence of an observation under a TB-PLS regression model, whereby 
a data point is deemed unusual for the model if it has high predictive influence. The rationale is 
that, under a given cluster-specific regression model, any unusual observation should be removed 
from that cluster and allocated to a different one. 

The article is organised as follows. In Section[5]we review the two-block PLS regression model, 
and describe the problem of modelling heterogeneous data. In Section [3] we introduce a measure 
of predictive influence for TB-PLS regression, and address the predictive partitioning problem; an 
objective function is first proposed, and an iterative multi-view predictive partitioning (MVPP) 
algorithm is presented. To be best of our knowledge, no other multi-view predictive clustering 
algorithm has been proposed in the literature. In Section [4] we describe a number of Monte 
Carlo simulation settings that will be used to illustrate the performance of the proposed methods 
under different scenarios. The results, as well as comparisons to alternative multi-view clustering 
algorithms, are then presented in Section[5] The applications to real web pages and academic paper 
clustering in Section [6] demonstrate the performance of the algorithm on real data. Concluding 
remarks are found in Section 

2 High-dimensional multi-response regression 
2.1 Two block partial least squares regression 

Suppose we have observed a random sample of n independent and identically distributed data 
points, {xi,yi}, for i = where each Xi £ K lxp is the "explanatory view" and each 

Ui £ M. lxq is the "response view" observed on the i th sample. The dimensions of both views, 
p and q, are allowed to be very large. The n observations can then be arranged in two paired 
matrices, one containing all the explanatory variables observed in the samples, X £ M. nxp , and 
one containing the corresponding response variables, Y £ W ixq . The variables in both views are 
centred and scaled so as to have zero mean and unit variance. 

The TB-PLS regression model assumes that the predictor and response views are noisy realisa- 
tions of linear combinations of hidden variables, or latent factors.The specific form of the TB-PLS 
model is given in the following definition. 
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Definition 1. The TB-PLS model assumes the existence of R pairs of orthogonal latent factors, 
and s<» € R" x \ forr = l,...,R such that 



R 



x = Y^t {r) P (r) +e x , y = ^ s w 9 w +^ (i) 

r—l r—l 

where E x 6 K" xp and E y G R™* 9 are matrices of residuals. For each r, the latent factors are 
£<» = XjjfO an d S M — Yv^ where g R pxl and g R qxl are weight vectors of unit 
length. The vectors € R pxl and € R qxl are the factor loadings. 

For any given r, each pair of latent factors {t^ r \ s^} provides a one-dimensional representation 
of both views and is obtained by identifying the directions on which the projected views have 
maximal covariance. Therefore, the paired latent factors satisfy the property that 

Cov(t< r \a< r) )= max Cov{Xu^ , Yv^) 2 , (2) 

under the constraints that = ||i>^|| = 1 for all r = 1,...,R. For r — l this optimisation 

problem is equivalent to 

A (D =t (D T s (D = max u W T X T Yv^ (3) 

under the same constraints posed on the weights. Here, is the largest singular value of X T Y 
and the weights are the corresponding left and right singular vectors. The R weight vectors that 
satisfy Eq. Q can then be found by computing the singular value decomposition (SVD) of X Y, 
that is X T Y = UAV T , where U = [u' 1 ),..,^] € K pxp and V = «(«)] € R qxq are 

orthonormal matrices whose columns are the left and right singular vectors of X T Y , respectively. 
A e M. pxq is a diagonal matrix whose entries are the ordered singular values of X T Y. Therefore 
it' r ) and are taken to be the r th left and right singular vectors of X T Y, respectively. 

The predictive relationship between the two views is driven by a linear regression model in- 
volving the R pairs of latent factors. For each r, the response latent variable depends on the 
explanatory latent variable, as follows 



s 



M _ t (r) g (r) + h (r) ^ 



where each g^ is a scalar regression coefficient which describes the projection of the latent factor 
relating to the response onto the latent factor relating to the predictors, and each e R nxl 
is the vector of residual errors. Since the latent factors are assumed to have zero mean, there 
is no intercept term. Using the inner regression models Q, the TB-PLS model can now be 
re-written in the more familiar form 

R 

Y = xY J U {r) g {r) q {r)T +E = Xf3 + E, (5) 

r=l 

where the regression coefficients have been defined as 

f3 = f2^ r) 9 ir) Q (r)T (6) 

r=l 
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and depends on the parameter sets ®^ = {u^ r ',v^ r ',g^,q^ r '}, with r — l,...,R. Each one of 
the R factor loadings are obtained by performing univariate regressions, 



s W T sW 



and each of the i? regression coefficients <?( r ', from the inner model of Eq. Q, is estimated by 
least squares regression of i' r ) on s' r ', so that 



r'j V>. (8) 

In high-dimensional settings, such as the one we consider, it is generally appropriate to assume 
the data has spherical covariance within each view [9], and so X T X = I p and Y T Y = I q . 
Although clearly incorrect in many real-world applications, this assumption has been shown to 
provide better results, particularly in classification problems, than attempting to estimate the 
true covariance matrices especially when p, q » n [2]- This can be seen as an extreme form of 
regularisation which introduces a large bias to reduce the variance in the estimated parameters 
and has been widely used in applications involving genomic data [551 IHJ H31 US] • 

2.2 Modelling heterogeneous data 

The TB-PLS regression model rests on the assumption that the n independent samples are rep- 
resentative of a single, homogeneous population. Under this assumption, the latent factors that 
determine the regression coefficients in Eq. ^ can be optimally estimated using all the avail- 
able data. However, in many applications the observations may be representative of a number 
of different populations, each one characterised by a different between- views covariance structure. 
Failing to recognise this would lead to a biased estimation of the latent factors, which would in 
turn provide a sub-optimal predictive model. 

We are interested in situations in which the observations have been sampled from K different 
sub-populations, where the exact value of K may be unknown. It can be noted that in general 
the optimal dimension is not necessarily the same across clusters. The problem involves 
simultaneously recovering the cluster assignments and their parameter sets, as well as learning the 
optimal K. Learning the optimal dimensionality in each cluster is a much harder problem which 
we address later. 

A simple illustrative example is given in Figure l(a)| where K = 2, Ri = 2, R 2 = 1 and 



p = 3. Here, under the X view, the points are uniformly distributed along either one of two lower 
dimensional subspaces, a line and a plane, both embedded in the three-dimensional space. To 
generate data points under the Y view that can be linearly predicted using the explanatory view, 
we take a linear combination of variables in the explanatory view and add some Gaussian noise. 



Clearly, fitting a global TB-PLS model would be inappropriate here, as shown in Figure 1(b) 
which shows that the estimated subspaces differ from the true ones, so the predictive ability of 
the model is sub-optimal. We will revisit this example in Section [5] and show that our multi-view 
clustering algorithm recovers the true sub-spaces, as in Figure [6] 
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3 Predictive partitioning 



3.1 A PRESS-based predictive influence measure 

The issue of detecting influential observations has been extensively studied in the context of OLS 
regression [TJ [22] . A common approach is based on examining the elements of the (n x n) "hat 
matrix", G = X(X T X) _1 X T . The term Gu is known as the leverage of the i th point, and 
determines the contribution of the i th point in estimating its associated response. The partial 
leverage, Gy is a related quantity which gives the contribution of the j th point for estimating the 
response associated to the i th point. These quantities are used to detect influential observations 
which have a larger contribution to the estimated responses relative to other points. However, 
G does not take into account any information from Y and so these leverage terms alone are not 
always sufficient to determine which observations are influential £Q. 

After fitting the regression model, a seemingly obvious way to identify influential observations 
might be to examine the residual error. However, it has been observed that points which exert a 
large leverage on the regression may obtain relatively smaller residual errors compared to other 
points as a result of over-fitting [23] . 

A more effective approach to assessing the influence of a particular observation considers the 
effects of its removal from the regression model. This involves estimating the regression parameter 
n times, leaving out each observation in turn, and then evaluating the prediction error on the 
unused observation. If we let /3_a be the OLS regression coefficient estimated by using all but the 
i th observation, the corresponding leave-one-out (LOO) error is = yi — Xi(3-i. An observation 
can then be labelled as influential if its LOO error is particularly large. The choice of threshold 
for identifying an observation as influential is an open question and many strategies have been 
suggested in the literature pQ. 

The approach above is related to the leave-one-out cross validation error (LOOCV) which is 
often used to quantify the predictive performance of a regression model [26] , and is defined as the 
mean of the individual prediction errors, 



For OLS, it is well known that each prediction error featuring in Eq. (j9j) can be computed without 
the need to remove an observation and re-fit the regression model. This can be accomplished 
through a closed- form expression known as the PRESS [T], which gives 



In this form, the i leave-one-out residual can be seen as the i residual, scaled by one 
minus its leverage, Gu. Since the PRESS only depends on quantities estimated using least squares 
it has a computational cost in the order of a single least squares fit and, as such, is extremely 
efficient to compute. 

In previous studies, the PRESS has also been used for identifying influential observations in the 
context of PLS regression with univariate responses |32[ 119) . However, in practice its computation 




(9) 



G 



requires the regression model to be fit n times, each time using n — 1 data points. A similar 
strategy for the evaluation of the PRESS in an TB-PLS model, when the response is multivariate, 
would require n SVD computations, each one having a computational cost of O (p 2 q + q 2 p) [T2] . 
This approach is particularly expensive when the dimension of the data in either view is large, as 
in our settings. 

Recently, we proposed a closed-form expression for computing the PRESS statistic under a TB- 
PLS model which reduces the computational cost of explicitly evaluating the lcave-one-out errors 
|20j . We overcome the need to recompute the SVD n times by approximating the leave-one-out 
estimates of the singular vectors {ii-i, V-i} with {u,v}. 



Definition 2. A closed-form approximation for the PRESS in Eq. ^ in a TB-PLS model is 

1 



i=l 
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(11) 



where = yi — x.i(3 is the TB-PLS residual error, and b — hiSiyi, with hi = Sj — gti being the 
i th residual error for the inner regression model of Eq. Q and E y i G R lxq being the i th residual 
in the TB-PLS model in Eq. ([IJ . 

The derivation of Eq. ( | 1 1 [ ) is provided in Appendix A. The error introduced by approximating 
the leave-one-out estimates of the singular vectors is of order O ( \ lo s(") ] . The denominator of 



Eq. (Ill is a scaling term related to the contribution of each data point to the latent factors, t 



and s. In this form, it can be seen that the TB-PLS PRESS has similarities with the PRESS for 



OLS regression in Eq. ( 10 1 where these scaling terms are related to the leverage each point exerts 
on the regression [1] . 

Using Eq. |TT| ) , we now consider how to measure the influence each point exerts on the TB-PLS 
model. Since we are interested in the predictive performance of the TB-PLS model, we aim to 
identify influential points as those observations having the greatest effect on the prediction error. 
In order to quantify this effect, we define the predictive influence of an observation {x.i, yi} as the 
rate of change of the PRESS at that point whilst all other observations remain constant. 

Definition 3. The predictive influence of a data point {xi, yi\, which we denote as TT x , y (xi, yi) G 
IR(p+<?)xi j is the total derivative of the PRESS with respect to the p variables in Xi and the q 
variables in yi, 

T 



dJ dJ dJ dJ 



(12) 



_dx til dx iiP dy iA dy,. 

The closed-form expression for the computation of this quantity is reported in Appendix [Bj 
The predictive influence offers a way of measuring how much the prediction error would increase 
in response to an incremental change in the observation {x iy yi} or alternatively, the sensitivity 
of the prediction error with respect to that observation. The rate of change of the PRESS at this 
point is given by the magnitude of the predictive influence vector, llTr^^ (xi, yi)\\ 2 ■ If this quantity 
is large, this implies a small change in the observation will result in a large change in the prediction 
error relative to other points. In this case, removing such a point from the model would cause a 
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large improvement in the prediction error. We can then identify influential observations as those 
for which the increase in the PRESS is large, relative to other observations. 

In the remainder of this Section we develop further the idea of using the predictive influence 
measure for multi-view clustering. 



3.2 The MVPP clustering algorithm 

Initially we assume that the number of clusters, K, is known. As mentioned in Section |2.2[ we 
want to allocate each observation {xi,yi}, i = 1, . . . , n into one of K non-overlapping clusters 
{Ci, . . . ,Ck} such that each cluster contains exactly rik observations, with Ylk=i n k — n i an d 
these points are as similar as possible in a predictive sense. Accordingly, we first define a suitable 
objective function to be minimised. 

Definition 4. The within- clusters sum of predictive influences is 

K 



C{®,C) = Y,H l^U^Vi) , (13) 
fe=i iec fe 

where n^y(xi,yi) is the predictive influence of a point {xi,y{\ under the k th TB-PLS model. 



Clearly, when Eq. ( 13 1 is minimised, each cluster consists of points that exert minimal pre- 
dictive influence for that specific TB-PLS model, and therefore the overall prediction error is 
minimised. We refer to these optimal clusters as predictive clusters. If the true cluster assign- 
ments were known a priori, fitting these models and thus minimising the objective function would 
be trivial. However, since the true partitioning is unknown, there is no analytic solution to this 
problem, and we resort to an iterative algorithm that alternates between finding optimal clus- 
ter assignments and optimal model parameters. Specifically, the algorithm we suggest alternates 
between the following two steps: 

1. Given K TB-PLS models with parameters {Oi, . . . , &k}, and keeping these fixed, find the 
cluster assignments which solve 

min C(Q,C). (14) 

{Ci,...,Ck } 

2. Given a set of cluster assignments, {Ci, ...,Ck} and keeping these fixed, estimate the param- 
eters of the K predictive models which solve 

min C(Q,C). (15) 

{Hi ,...,Hjc } 

We summarise the entire algorithm below. 

Initialisation (I): At iteration r = 0, an initial, random partitioning of the data, {Ci, ...,Ck}, 
is generated; both the TB-PLS models parameters, {Oi, . . . , Qk}, an d the predictive influences, 
Tta;}y{xii Hi), are computed for all K clusters and n observations. 

At each subsequent iteration r = 1,2,... the following two steps are performed in sequence 
until convergence: 
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Partitioning (P): Keeping the model parameters fixed, the cluster assignments that minimise 
( 14 1 are obtained by assigning each point to the cluster for which its predictive influence is smallest, 

C fe <- ji : mm tt^ (sb,-,^) | . (16) 

Estimation (E): Keeping the cluster allocations fixed, the parameters {0i, . . . , Ok} that 



minimise (15) are estimated using the data points {xi,yi} for all i € and according to Eqs. 



(§, and (§. 

The computational cost of each iteration of the MVPP algorithm is of the order of computing 
an SVD in each cluster, 0(K(p 2 q + q 2 p)). 

In order for the algorithm to converge to a local minimum we require that at each P step 
and each E step, the objective function must be decreasing. In Step P we assign observations to 



clusters based on the assignment rule ( 16 1 which minimises the predictive influence and so this 
decreases the objective function by definition. In Step E we do not directly seek to minimise the 
predictive influence, instead we estimate parameters <d^ ew in each cluster using TB-PLS. In order 
for these parameters to decrease the objective function it must be the case that they are closer to 
the optimal MVPP parameters, 0£ than the parameters estimated using TB-PLS at the previous 
iteration @° k ld . 

To see why this will be the case, we must consider what happens when points are reassigned to 
clusters. A large magnitude predictive influence is assigned to points which are influential under a 
given TB-PLS model. Therefore in Step E, points which have been newly assigned to cluster k will 
be influential under the TB-PLS model, relative to other points in that cluster. If we estimate 
a new TB-PLS model, <d^ ew , these point will be assigned a smaller magnitude predictive influence 
and so the sum of square predictive influences within each cluster will be decreased. The algorithm 
converges to a local minimum of the objective function for any initial cluster configuration. 



3.3 Model selection 

Model selection in both clustering and TB-PLS are challenging problems which have previously 
only been considered separately. Within our framework, the PRESS statistic provides a robust 
method for efficiently evaluating the fit of the TB-PLS models to each cluster. A straightforward 
application of the PRESS allows us to identify the optimal number of clusters, K. We also 
apply a similar intuition to attempt to learn the number of latent factors of each TB-PLS model, 
. . . , Rk ■ 

Since our algorithm aims to recover predictive relationships on subsets of the data, the number 
of clusters is inherently linked to its predictive performance. If K is estimated correctly, the re- 
sulting prediction error should be minimised since the correct model has been found. We therefore 
propose a method to select the number of clusters by minimising the out-of-sample prediction 
error which overcomes the issue of over-fitting as we increase K. The strategy consists in running 
the MVPP algorithm using values of K between 1 and some maximum value, K max . We then 
select the value of K for which the mean PRESS value is minimised. This is possible due to our 
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computationally efficient formulation of the PRESS for TB-PLS and the fact that we aim to re- 
cover clusters which are maximally predictive. The performance of this approach using simulated 
data is discussed in Section [531 

In the case where there is little noise in the data, the number of latent factors can be learned 
by simply evaluating the PRESS in each cluster at each iteration. Therefore, in the k th cluster, 
the value of is selected such that the PRESS is minimised. Since we select the value of 
which minimises the PRESS, this also guaranteed to decrease the objective function. However, as 
the amount of noise in the data increases, selecting each optimal Rk value becomes a more difficult 
task due to the iterative nature of the algorithm. In this case, setting R — 1 tends to capture 
the important predictive relationships which define the clusters whereas increasing each R^ can 
actually be detrimental to clustering performance. This issue is discussed in Section |5.5| 

4 Monte Carlo simulation procedures 

4.1 Overview 

In order to evaluate the performance of predictive partitioning and compare it to other multi-view 
clustering methods, we devise two different simulation settings which are designed to highlight 
situations both where current approaches to multi-view clustering are expected to succeed and 
fail. 

Commonly, clusters are considered to be formed by geometrically distinct groups of data points. 
This notion of geometric distance is also encountered implicitly in mixture models. Separation 
conditions have been developed for the exact recovery of mixtures of Gaussian distributions, for 
instance, for which the minimum required separation between means of the clusters is proportional 
to the cluster variances [T3J |H] • 

In scenario A, we construct clusters according to the assumption that data points have a 
similar geometric structure under both views which should be recovered by existing multi-view 
clustering algorithms. We assess the performance as a function of the signal to noise ratio. As the 
level of noise is increased, the between-cluster separation diminishes to the point that all clusters 
are undetectable using a notion of geometric distance whereas a clustering approach based on 
predictive influence is expected to be more robust against noise. On the other hand, under 
scenario B the clustering of data points is not defined by geometric structures. We simulate data 
under cluster-wise regression models where the geometric structure is different in each view. In 
this situation, clustering based on geometric separation is expected to perform poorly regardless 
of the signal to noise ratio. In all of these settings we set the number of latent factors, R = 1 and 
the number of clusters, K = 2. A detailed description of these two settings is given below. 

4.2 Scenario A: geometric clusters 

The first simulation setting involves constructing K geometric clusters (up until the addition of 
noise). We simulate each pairs of latent factors and s^ k \ with k = 1, . . . ,K, from a bivariate 
normal distribution. Each i = l,.,.,rik element, where rik = 50, is simulated as (tj , s\ k ^) ~ 
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JV(^k, S), where the means of the latent factors, fik defines the separation between clusters. The 
covariance matrix is given by S having unit variances and off diagonal elements <7jj = 0.9. 

In order to induce a covariance structure in the X loadings, we first generate a vector un fe ) of 
length p = 200 where each of the m — 1, . . . ,p elements is sampled from a uniform distribution 



w (k) 



Unif[0,l], if m= l,...,p/2 
Unif[l,2], if m=p/2+l,...,p 



The p elements of the X loadings and the q elements of the Y loadings are then simulated as 
■u( fc ) ~ J\f ^0, w^w^ T J , and «W ~ Unif[0, 1]. We then normalise the vectors so that ||«^ fe )|| = 
||v( fe )|| = l. Finally, for K = 2, each pair of observations is generated from the TB-PLS model in 
the following way 

_ f t{ 1] u^ T +E xl: ifieCi f s^v^ +E yl , ifieCi 

where each element of E x i E R lxp and E y/ i £ R lxq are sampled i.i.d from a normal distribution, 
Af(0, cr 2 ). The signal to noise ratio (SNR), and thus the geometric separation between clusters, is 
decreased by increasing a 2 . [FIGURE 

Figure [2] shows an example of data points generated under this simulation setting; the SNR is [2] 
large and the geometric clusters are well separated. As the SNR decreases, the geometric clusters AROUND 
become less well separated and so this setting tests the suitability of the predictive influence for HERE] 
clustering when the data is noisy. 

4.3 Scenario B: predictive clusters 

The second setting directly tests the predictive nature of the algorithm by breaking the link 
between geometric and predictive clusters. In this setting, the geometric position of the clusters 
in X and the predictive relationship between X and Y are no longer related. We start by 
constructing the data as in the previous section for K = 2. However, we now split the first cluster 
in X space into three equal parts and translate each of the parts by a constant c\. For all i E C± 

Xi + ci if i = 1, . . . ,7ifc/3 
Xi = ^ Xi if i — 7ife/3 + 1, . . . , 2rik/3 

Xi — ci if i = 277^/3 + 1, . . . ,7ifc 

We then split the second cluster in X space into two equal parts and perform a similar translation 
operation with a constant C2. For all i £ C2 



X ; 



Xi+c 2 if i = 1, . . . , rifc/2 

Xi if i = n fe /2 + 1, . . . ,n k 



The result is that there are now four distinct geometric clusters in X space but still only two 
clusters which are predictive of the points in Y space. Parametrising the data simulation procedure 
to depend on the constants c\ and C2 means that we can generate artificial datasets where one of 
the geometric clusters in C\ are geometrically much closer to C2 however the predictive relationship 
remains unchanged. We call these structures "confounding clusters" . 
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Figure [3] shows an example of this simulation setting when the SNR is large. In this setting, 
noise is only added in the response which preserves the confounding geometric clusters in X but 
removes the separation between clusters in Y. Therefore we expect methods which do not take 
into account predictive influence to fail to recover the true clusters and instead only recover the 
confounding geometric clusters. 

5 Examples and simulation results 
5.1 Influential observations 

Initially we assess the ability of our criterion for detecting influential observations under a TB-PLS 
model, and demonstrate why using residuals only is unsuitable. For this assessment, we assume 
a homogeneous population consisting of bivariate points under each view, so p = q = 2. We also 
assume that one latent factor only is needed to explain a large portion of covariance between the 
views. 

In order to generate data under the TB-PLS model, we first create the paired vectors {t, s} 
by simulating n = 100 elements from a bivariate normal distribution with zero mean, unit vari- 
ances and off diagonal elements 0.9. The corresponding factor loadings p and q are simulated 
independently from a uniform distribution, Unif(0, 1). We then randomly select three observa- 
tions in the X view and add standard Gaussian noise to each so that the between-view predictive 



relationship for those observations are perturbed. Figure 4(a) shows a plot of the predictors X 
and the responses Y. The three influential observations are circled in each view. Since these 
observations are only different in terms of their predictive relationships, they are undetectable by 
visually exploring this scatter plot. 

Using all 100 points, we fit a TB-PLS model with R = 1 and compute both the residual 
error and the predictive influence of each observation. In Figure |4(b)[ the observations in X 
are plotted against their corresponding residuals (shown in the left-hand plot) and predictive 
influences (shown in the right-hand plot). Since TB-PLS aims to minimise the residual error of 
all observations, including the influential observations results in a biased model fit; although the 
influential observations exhibit large residuals, this is not sufficient to distinguish them from non- 
influential observations. On the other hand, the predictive influence of each point is computed by 
implicitly performing leave-one-out cross validation and, as a consequence of this, the predictive 
influence of those points is larger than that of any of the other points. This simple example 
provides a clear indication that the influential observations can be identified by comparing the 
relative predictive influence between all points. 

We also perform a more systematic and realistic evaluation in higher dimensions. For this 
study, we simulate 300 independent datasets, whereby each dataset has p = q = 200, n = 100 
and three influential observations. We follow a similar simulation procedure as the one described 
before, and set R = 1. Once the TB-PLS has been fit, all points are ranked in decreasing order, 
from those having the largest predictive influence and largest residual. We then select the first 
top m ranked observations (with m = 1, . . . , n) and define a true positive as any truly influential 
observation that is among the selected ones; all other observations among those m are considered 
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false positives. 

Figure [5] compares the receiver operating characteristic (ROC) curve obtained using the predic- 
tive influence and the residual error for this task. This figure shows that the predictive influence 
consistently identifies the true influential observations with fewer false positives than when the 
residual is used. This suggests that using the residuals to detect influential observations in high di- 
mensions is approximately equivalent to a random guess, and clearly demonstrates the superiority 
of the proposed predictive influence measure for this task. 

5.2 Confounding geometric structures 

Here we revisit the illustrative example described in Section |2.2| The predictors shown in Figure 
[T] consist of three-dimensional points sampled uniformly along a line and a plane, and these two 
subspaces intersect. The response consists of a noisy linear combination of the points in each 
cluster. Using the same simulated data, we can explore the performance of both our MVPP 
algorithm and a different multi-view clustering algorithm, MV-CCA [9]. MV-CCA fits a single, 
global CCA model which assumes all points belong to the same low dimensional subspace and 
that clusters are well separated geometrically in this subspace. 

Figure [6] shows the clustering results on this example dataset using both MV-CCA and MVPP. 



The result of clustering using MV-CCA shown in Figure 6(a) highlights the weaknesses of using 



a global, geometric distance-based method since the existence of clusters is only apparent if local 
latent factors are estimated using the correct subset of data. MV-CCA fits a single plane to the 
data which is similar to the one estimated by a global TB-PLS model, as in Figure [1(b)] The 
points are then clustered based on their geometric separation on that plane which results in an 
incorrect cluster allocation. 

In comparison, Figure |6(b)| shows the result of clustering with MVPP showing how the ability 
to recover the true clusters, and therefore deal with the confounding geometric structures, by 
inferring the true underlying predictive models. Moreover, since the noise in the data low, in this 
example we are able to let MVPP learn the true number of latent factors in each cluster using the 



procedure described in Section 3.3 



5.3 Clustering performance 

Using data simulated under scenarios A and B, we assess the mean clustering and predictive 
performance of the MVPP algorithm in comparison to some multi-view clustering algorithms 
over 200 Monte Carlo simulations. In each simulation, the latent factors, loadings and noise are 
randomly generated as described in section [4] We also examine issues relating to model selection 
in the MVPP algorithm. 

Figure [7] shows the result of the comparison of clustering accuracy between methods when 
if = 2 in scenario A. A SNR of 10 01 indicates that signal variance is approximately 1.3 times 
that of the noise variance and so the clusters in both views are well separated whereas a SNR of 
10 -0 5 indicates that the clusters overlap almost completely. It can be seen that when the noise 
level is low, MVPP is able to correctly recover the true clusters. As the noise increases, and 
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the geometric separation between clusters is removed, the clustering accuracy of the competing 
methods decreases at a faster rate than MVPP. 

Since MV-CCA assumes that the clusters are well separated geometrically, as the noise in- 
creases the estimated latent factor is biased which decreases the separation between the clusters. 
Another reason for the difference in performance between MV-CCA and MVPP lies with how the 
multiple views are used for clustering. Although MV-CCA clustering derives a low dimensional 
representation of the data using both views, the actual clustering is performed using the latent 
factors of only one view. MVPP considers the important predictive contribution from both views 
in constructing the predictive influences and so clustering occurs jointly between the views. 

The MV-kernel method [10J relies on the Euclidean distance between points in constructing the 
similarity matrix. This method works well only when the clusters are well separated in each view. 
Computing the Euclidean distance between points in high dimensions before performing dimen- 
sionality reduction means that the MV-kernel method is affected by the curse of dimensionality. 
As such, its performance degrades rapidly as the SNR decreases. 

WCC [T7] clusters each view separately using i^-means and combines the partitions to obtain 
a consensus. Since it does not take into account the relationship between the two views, when the 
data is noisy this can result in two extremely different partitions being recovered in each view and 
therefore a poor consensus clustering. 

Figure [8] shows the result of the comparison between methods in scenario B. It can be seen that 
MVPP consistently clusters the observations correctly in this challenging setting and is extremely 
robust to noise due to the implicit use of cross-validation. Since none of the other methods 
takes into account the predictive relationship between the clusters and instead only find geometric 
clusters, they all consistently fail to identify the true clusters. The similar performance for low 
levels of noise corresponds to these methods consistently misclustering the points based on their 
geometric position. As the noise increases, the performance of WCC, MV-CCA and MV-kernel 
remains fairly constant. This confirms that these methods are not correctly utilising the important 
information in the second view of data even when the predictive clusters in the response are well 
separated. 
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5.4 Predictive performance 

Since only MVPP considers the predictive performance of the clustering by evaluating the PRESS 
error in each cluster, in order to test the predictive performance of the competing multi-view 
clustering algorithms we must perform clustering and prediction in two steps. Therefore we first 
perform clustering with each of the methods on the full dataset and then train a TB-PLS model in 
each of the obtained clusters. We then test the predictive ability by evaluating the leave-one-out 
cross validation error within each cluster. For comparison, we also evaluate the LOOCV error of 
a global TB-PLS model which we fit using all of the data. 

Figure [9] shows the result of predictive performances under scenario A. This figure shows that 
MVPP achieves the lowest prediction error amongst the multi-view clustering methods. This is 
to be expected since the clusters are specifically obtained such that they are maximally predictive 
through implicit cross validation. The prediction error of the competing multi-view methods is 



[FIGURE 

m 

AROUND 
HERE1 



14 



larger than MVPP which indicates that these methods are really not selecting the truly predictive 
clusters. As the noise increases, the prediction performance of all methods decreases however as 
MVPP is more robust to noise than the competing methods, its relative decrease in performance 
is smaller. It can be noted that for low levels of noise the global predictive model performs worst 
of all. This further supports the notion of attempting to uncover locally predictive models within 
the data. 

Figure [T0| shows the prediction performance in scenario B. Since MVPP is able to accurately 
recover the predictive clusters, it displays the lowest prediction error amongst the multi-view 
clustering methods. As noted above, the other multi-view clustering methods only recover the ge- 
ometric clusters and so their prediction performance is worse. The relative performance difference 
between competing methods stays similar as noise increases however, since MVPP is affected by 
noise in Y, its predictive performance decreases relative to the other methods. 
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5.5 Model selection 

The ability of MVPP to learn the true number of clusters in the data is assessed using the procedure 



in Section 3.3 In this experiment, the data was simulated under setting A and the true number of 
clusters was set as K = 2. Figure [IT] shows a comparison between the PRESS prediction error and 
the objective function for different values of K over 200 Monte Carlo simulations. As expected, 
the objective function decreases monotonically as K is increased whereas the PRESS exhibits a 
global minimum at K = 2. 

In the above simulation settings, the number of latent factors was fixed to be R = 1. According 
to the TB-PLS model in Section 12.11 the first latent factor is the linear combination of each of the 
views which explains maximal covariance between X and Y . Therefore, the first latent factor is 
the most important for prediction. Each successive latent factor explains a decreasing amount of 
the covariance between the views and so contributes less to the predictive relationship. 

Figure [12] shows the effect of the number of latent factors, R on the clustering accuracy of 
MVPP in scenario A. It can be seen that for low levels of noise, when the clusters are well 
separated, increasing R has little effect on the clustering accuracy. As the noise increases, the first 
latent factor appears to capture all of the important predictive relationships in the data whereas 
subsequent latent factors only fit the noise which causes a detrimental effect on the clustering 
accuracy as more latent factors are added. 

6 An applications to web data 
6.1 Data description 

The proposed MVPP method, as well as alternative multi-view clustering algorithms, have been 
tested on two real world datasets. The first is the WebKE0 dataset which consists of a collection 
of interconnected web pages taken from the computer science departments of four universities: 
Cornell, Texas, Washington and Wisconsin. This dataset is commonly used to test multi-view 
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1 http://www. cs.cmu.edu/ webkb 
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clustering algorithms, and therefore provides an ideal benchmark [3j[l3l[l0]. We treat each web 
page as an observation, where the predictor vector, x t is the page text and the response vector, 
Hi is the text of the hyperlinks pointing to that page. The dimensions of these vectors for each 
university is given in Table [TJ Both views of the pages consist of a bag of words representation of 
the important terms where the stop words have been removed. Each word has been normalised 
according to its frequency in each page and the inverse of the frequency at which pages containing 
that word occur (term frequency-inverse document frequency normalisation). 

There are two separate problems associated with the WebKB dataset. The first problem, 
which we denote WebKB-2, involves clustering the pages into two groups consisting of "course" 
and "non-course" related pages respectively. The second problem, WebKB-4, involves clustering 
the pages into four groups consisting of "course" , "student" , "staff" and "faculty" related pages. 
It is known that a predictive relationship exists between views [27] and so we expect the results 
obtained by MVPP to reflect the ability to exploit that relationship in order to correctly identify 
clusters. 

We also evaluate the clustering and prediction performance of MVPP and competing methods 
on a second benchmark dataset, the Citeseer dataset (3], which consists of scientific publications 
(n = 3312) belonging to one of six classes of approximately equal sizes. The predictor view, Xi, 
consists of a bag of words representation of the text of each publication in the same form as the 
WebKB dataset (p = 3703). We perform two different analyses: in the first one, the response 
view iji comprises of a binary vector of the incoming references between a paper and the other 
publications in the dataset (q = 2316); in the second, the response view comprises of a binary 
vector of the outgoing references from each paper (q = 1960). 



[TABLE [T] 

AROUND 

HERE1 



6.2 Experimental results 

For the WebKB-2 clustering problem there are two true clusters of approximately equal size. 
We again compare MVPP with the WCC, MV-CCA and MV-Kernel clustering methods. For 
each method, we then evaluate the leave-one-out prediction error for the previously recovered 
clusterings. We also evaluate the leave-one-out prediction error for global PLS which has been 
estimated using all the data. [TABLE [2] 

Table [2] shows the results of clustering and prediction on the WebKB-2 dataset. In all cases, AROUND 
MVPP achieves almost 100% clustering accuracy whereas the other methods achieve between HERE] 
50 — 87% accuracy which suggests that there is a predictive relationship between the text view of 
the webpage and the incoming links which MVPP is able to exploit to recover the true clusterings. 
MVPP also achieves a much lower prediction error than the other clustering methods which vary 
widely. This suggests that since the dimensionality of the problem is large, a small error in cluster 
assignment can lead to fitting a poor predictive model. [TABLE [3] 

For the WebKB-4 clustering problem there are four true clusters where one of the clusters AROUND 
is much larger than the others. This poses a particularly challenging scenario since K-means HERE] 
based techniques favour clusters which are of a similar size. Table [3] details the results on this 
dataset. Again, in all cases, MVPP achieves the highest clustering accuracy. In this dataset, the 
clustering accuracy for MVPP is approximately 15% lower than for K — 2 due to the irregular 
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cluster sizes and the poorer separation between clusters. The other methods also generally achieve 
poorer clustering accuracy however the relative decrease is not as large. Similarly for the previous 
dataset, the better clustering performance of the multi-view methods does not necessarily achieve 
better prediction performance. Despite achieving a relatively poorer clustering accuracy, fitting 
four clusters instead of two greatly improves the prediction performance of all clustering methods. [TABLE [4] 
Table [4] shows the results for clustering and prediction using the Citeseer dataset. It can be AROUND 
seen that in both configurations, MVPP achieves the highest clustering accuracy although the rcl- HERE] 
ative difference is not as large as for the WebKB dataset. In this case, MVPP achieves the lowest 
prediction error of all methods. The large variance in prediction error between the multi-view 
clustering methods despite their similar clustering accuracy again suggests that incorrectly clus- 
tering observations can severely affect the prediction performance due to the high dimensionality 
of the data. 



7 Conclusions 

In this work, we have considered the increasingly popular situation in machine learning of identi- 
fying clusters in data by combining information from multiple views. We have highlighted some 
cases where the notion of a predictive cluster can better uncover the true partitioning in the data. 
In order to exploit this, our work consolidates the notion of predictive and cluster analysis which 
were previously mostly considered separately in the multi-view learning literature. 

In order to identify the true predictive models in the data, we have developed a novel method 
for assessing the predictive influence of observations under a TB-PLS model. We then perform 
multi-view clustering based on grouping together observations which are similarly important for 
prediction. The resulting algorithm, MVPP, is evaluated on data simulated under the TB-PLS 
model such that the true clusters are predictive rather than geometric. The results demonstrate 
how geometric distance based multi-view clustering methods are unable to uncover the true parti- 
tions even if those methods explicitly assume the data is constructed using latent factors. On the 
other hand, MVPP is able to uncover the true clusters in the data to a great degree of accuracy 
even in the presence of noise and confounding geometric structure. Furthermore, the clusters 
obtained by MVPP provide the basis of a better predictive model than the clusters obtained by 
the competing methods. An application to real web page and academic paper data show similar 
results. 

The computational complexity of MVPP is at least if-times more expensive compared with 
the other CCA-based multi-view clustering algorithms. This computational cost in incurred due 
to the need to iteratively fit a predictive model in each cluster which can be expensive when 
the dimensionality of the data is high. However, as shown by our results on simulated and real 
datasets, it appears that such a strategy is necessary in order to recover an accurate partitioning 
of the data. 

Determining an initial partitioning which performs better than a random initialisation is a 
difficult problem since identifying the local predictive relationships a priori is not always possible 
using global methods. An obvious choice would be to initialise the algorithm using the results from 
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the MV-CCA method. However, it can be seen that in certain situations (such as scenario B), 
MV-CCA results in a poor cluster assignment which may result in the MVPP algorithm getting 
stuck in a poor local minimum. 

We have also attempted to unify the difficult issues of model selection in clustering and TB- 
PLS which have previously only been considered separately. We have shown that our prediction 
based clustering criterion can be used to learn the true number of clusters. However, we have also 
seen that learning the number of latent factors in each of the TB-PLS models remains a difficult 
problem due to the effects of noise and the iterative nature of the algorithm. 

The idea of multi-view clustering based on prediction has not been explored before in the 
literature, but there are examples of clustering using mixtures of linear regression models in which 
the response is univariate [S]. However, it is well known that the least squares solution is prone 
to over-fitting and does not represent the true predictive relationship inherent between the views. 
Furthermore, the least squares regression applies only to a univariate response variable, and is not 
suitable for situations where the response is high dimensional. 

A possible extension to the MVPP method for high dimensional and noisy data is to apply 
an additional constraint on the l\ norm of the TB-PLS weights estimated in Eq. Such a 

constraint induces sparsity in the TB-PLS solution such that only a small number of variables 
contribute to the predictive relationship between the views. This can be achieved, for example, 
using the Sparse PLS method of [21]. However, this requires the specification of additional tuning 
parameters which cannot be easily learned from the data. 



A PLS PRESS 

For one latent factor we can write the i th leave one out error as = yi — Xi/3-i , where /3_j is 
estimated using all but the i th observation. Since j3 = ugq T , we can write e_; = yi — x i u_ i g_ i q 1 _ i . 
The difference between the singular vectors estimated using all the data and the leave-one-out 
estimate, \\u — «_i|| is of order O ( \f ^^i ^j E5] so that if n is large, we can write = 
Vi - XiUg^q^i- 

Using the matrix inversion lemma, we can obtain recursive update equations for g^i which 
only depends on g and does not require an explicit leave-one-out step in the following way 

9-i = 9 — ^ > ( 17 ) 

where the expression for g is given by Equation (JsJ) . In the same way we derive the following 
expression = q — wzzl*g_Kf s ) — ?i j where the expression for q is given by Equation Q. 



Equation (11) is then obtained by using these values for g_i and q ; in /3_, = ug- i q^_ i and 
simplifying. 
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B Predictive influence 



Taking the partial derivative of the PRESS function, J with respect to Xi 



dJ 19||e_ 
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Taking derivatives of the constituent parts of e_j in Equation ( 11 ) with respect to x^ we obtain 
d 



dx. 



d 
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We can now obtain - by combining Equations (j 1 S[) so that 

d 
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dx. 



2eJ 



where e 4 _j = ;) . Finally, 
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The derivation of predictive influence with respect to yi follows the same argument and so the 



predictive influence, Tv x y (x i ,y i ) = [ 
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dxi ' J 
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(b) Global TB-PLS. 

Figure 1: Figure |T(a)| shows the two clusters in the X view consist of points sampled uniformly 
on a line and a plane embedded in three dimensions. The clusters in the Y view are noisy linear 
combinations of the corresponding clusters in the X view so that there is a predictive relationship 



between the views. Figure 1(b) shows the result of fitting a global TB-PLS model to the data. It 
can be seen that the resulting subspace in the X view lies between the clusters and as a result 
few of the observations in the response lie on the estimated subspace. 
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Figure 2: An example of data generated in scenario A where the clusters are "geometric clusters" 
i.e. the Euclidean distance between points within clusters is small compared to points between 
clusters. The predictors X G jjioox20o an( ^ reS p onse y g ]g>ioox20o ^ me ^ een plotted in the 
projected space. 
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Figure 3: An example of data generated in scenario B. Points in the predictive clusters in X 
have been translated to create four clear geometric clusters. In this case, in the X view, the 
distance between cluster 2 (crosses) and two of the geometric clusters from cluster one (dots) is 
smaller than the distance between the points in cluster one. This implies that Euclidean distance 
based clustering will fail to recover the true clusters. The predictors X g jjioox20o an( j reS p 0nse 
Y G U 100x20 ° have been plotted in the projected space. 
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(a) Two-dimensional predictors and responses generated under the TB-PLS model. The influential observations are 
circled. 



6-, 
5- 
4- 

= 3^ 

co 

2 



12-, 
10- 

cd 
o 

CD 8 

*c 

m 6^ 



=5 4-| 

CD 




2- 




(b) Two-dimensional predictors plotted against their corresponding magnitude residual error and predictive influence, 
respectively. 



Figure 4: 4(a) shows the two dimensional predictors, X and responses, Y with the influential 
observations circled. It is clear that the influential observations cannot be identified by simply 
examining these scatter plots. |4(b)| shows the magnitude residual (left-hand plot) and predictive 
influence (right-hand plot) for each observation in X. The predictive influence of the influential 
observations is much larger than that of all other observations so that these points are clearly 
identified. The same degree of separation is not evident by examining the magnitude residual 
error. 



24 



- predictive influence 

- residual 



0.4 0.6 
False positive rate 



Figure 5: Receiver operating characteristic (ROC) curve which compares the ability to detect 
outliers of the predictive influence and the residual in high dimensions [p = q = 200). The results 
are averaged over 300 Monte Carlo simulations. Using the predictive influence to detect influential 
observations consistently identifies more true positives for a given false positive rate than using 
the residual. The predictive influence detects all influential observations with a false positive rate 
of 0.34 whereas the residual consistently identifies almost as many false positives as true positives. 



University 


Observations 


View 1 


View 2 




Course 


Student 


Staff 


Faculty 


(P) 


(q) 


Cornell 


83 


18 


38 


32 


1703 


694 


Texas 


103 


18 


33 


31 


1703 


660 


Washington 


106 


19 


65 


27 


1703 


715 


Wisconsin 


116 


22 


70 


34 


1703 


745 



Table 1: A summary of the number of observations and variables in the different configurations 
of the WcbKB dataset. 
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(b) Clustering using MVPP 

Figure 6: Plot |(a)| shows the result of clustering the example dataset introduced in Figure [I] using 
the MV-CCA method. It can be seen that MV-CCA fits a single plane to the data and assigns 
points to clusters based on geometric distances between points on that plane so the resulting 
clustering is incorrect. Plot |(b)| shows the result of clustering using the MVPP algorithm which 
models the predictive relationship within each cluster. As a result, the true subspaces and cluster 
assignments are recovered. 
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Figure 7: Comparing the mean clustering accuracy of different methods for K — 2 in simulation 
setting A over 200 Monte Carlo simulations. When the SNR is high, MVPP achieves maximum 
accuracy and as the noise increases, the decrease in performance is small relative to the other 
methods. 




Figure 8: Comparing the mean clustering accuracy in simulation setting B over 200 Monte Carlo 
simulations. MVPP achieves a high clustering accuracy for all levels of noise whereas the competing 
methods perform poorly even when the SNR is large, since they recover clusters based on the 
confounding geometric structure in the X view. 
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Figure 9: Comparing the mean leave-one-out prediction error over 200 Monte Carlo simulations of 
the clusters obtained by different methods for K — 2 in simulation setting A. MVPP consistently 
achieves the lowest prediction error of the multi-view clustering methods due to the clusters being 
selected based on their predictive ability. Similarly to the clustering performance, as the noise 
increases the relative difference between MVPP and the other methods also increases. It can be 
seen that all clustering methods achieve better prediction than a global PLS model. 




Figure 10: Comparing the mean leave-one-out prediction error of the clusters obtained in simula- 
tion setting B over 200 Monte Carlo simulations. MVPP achieves the best prediction performance 
of the multi-view clustering methods. Since as noise increases, the relative clustering clustering 
performance between MVPP and the competing methods decreases, this relative predictive per- 
formance of MVPP also decreases. Again, global PLS achieves the worst prediction accuracy of 
all methods. 
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Figure 11: Comparing the prediction error with the objective function for different values of K 
in the first simulation setting where the true value of K = 2. It can be seen that as K increases 
the global minimum of the PRESS occurs when K = 2, whereas the objective function decreases 
monotonically as it begins to overfit the data. The error bars also show that the standard deviation 
of the PRESS is smallest when K = 2. This allows us to use the prediction error to select the true 
number of clusters. 
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Figure 12: The effect of the number of latent factors, R on the clustering accuracy. For low levels 
of noise, increasing R has little effect on the clustering accuracy. However, as the noise increases, 
it can be seen that the first latent factor explains all of the signal in the data and increasing R 
has a detrimental effect on the clustering accuracy. 
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University Global PLS WCC MV-CCA MV-Kernel MVPP 



Cornell 



Ace 




0.50 


0.56 


0.65 


0.96 


Error 


163.69 


46.71 


137.65 


159.93 


37.35 


Texas 












Ace 




0.50 


0.57 


0.71 


0.95 


Error 


177.50 


40.90 


132.01 


173.79 


33.74 


Washington 












Ace 




0.87 


0.79 


0.69 


0.97 


Error 


209.40 


46.44 


106.86 


109.16 


31.53 


Wisconsin 












Ace 




0.67 


0.76 


0.59 


0.98 


Error 


234.16 


68.86 


171.58 


244.85 


55.72 



Table 2: The clustering accuracies (Acc) and mean squared leave-one-out prediction error on the 
WebKB-2 dataset. MVPP consistently accurately recovers the true clusters and therefore also 
obtains the best prediction accuracy. The large variance in prediction accuracy between the other 
methods demonstrates the importance of fitting the correct local models. 



University Global PLS WCC MV-CCA MV-Kernel MVPP 



Cornell 












Acc 




0.70 


0.69 


0.44 


0.83 


Error 


163.69 


35.43 


19.77 


105.04 


17.89 


Texas 












Acc 




0.58 


0.68 


0.41 


0.86 


Error 


177.50 


54.35 


26.21 


141.34 


18.97 


Washington 












Acc 




0.70 


0.68 


0.53 


0.75 


Error 


209.40 


36.14 


33.26 


98.58 


17.34 


Wisconsin 












Acc 




0.69 


0.74 


0.53 


0.85 


Error 


234.16 


61.61 


31.27 


110.89 


21.13 



Table 3: The clustering accuracies (Acc) and mean squared leave-one-out prediction error on 
the WebKB-4 dataset. MVPP again achieves the best clustering and prediction performance. 
Although the clustering accuracy is worse than in the WebKB-2 configuration, the improved 
prediction performance suggests that fitting four clusters is a more accurate model of the data. 
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Configuration 


Global PLS 


wcc 


MV-CCA 


MV-Kernel 


MVPP 


Text + Inbound 












Acc 




0.76 


0.76 


0.73 


0.81 


Error 


344.06 


70.62 


76.50 


110.30 


39.51 


Text + Outbound 












Acc 




0.76 


0.76 


0.72 


0.87 


Error 


278.53 


110.46 


84.50 


73.95 


52.96 



Table 4: The clustering accuracies (Acc) and mean squared leave-one-out prediction error on the 
Citeseer dataset. MVPP achieves the best clustering accuracy and prediction error whereas the 
other methods all achieve a similar clustering accuracy. 
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